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Abstract 

We report on a recently initiated project which aims at building a multi-layered parallel treebank of English and German. Particular 
attention is devoted to a dedicated predicate-argument layer which is used for aligning translationally equivalent sentences of the two 
languages. We describe both our conceptual decisions and aspects of their technical realisation. We discuss some selected problems and 
conclude with a few remarks on how this project relates to similar projects in the field. 



1. Introduction 

Parallel corpora are widely accepted as a valuable data 
source for machine translation and other research. So far, 
however, the amount of linguistic annotation in these cor- 
pora is limited, and particularly multilingual corpora an- 
notated with syntactic information are rare. Our goal is 
to build a treebank of aligned parallel 1 texts in English 
and German with the following linguistic levels: POS tags, 
constituent structure, functional relations and predicate- 
argument structure for each monolingual subcorpus, plus 
an alignment layer to "fuse" the two - hence our working 
title for the treebank, FuSe, which additionally stands for 
/imctional semantic annotation ( jCyrus et al., 2 003 1. 

We use the Europarl Corpus (Koehn, 2002 1, which con- 
tains sentence-aligned proceedings of the European parlia- 
ment in eleven languages and thus offers ample opportunity 
for extending the treebank at a later stage. 2 For syntactic 
and functional annotation we basically adapt the TIGER an- 
notation scheme ( Albe rt et al., 20 03 1, making adjustments 
where we deem appropriate and changes which become 
necessary when adapting to English an annotation scheme 
which was originally developed for German. 

The fusion of the language pair will take place on 
an alignment layer which connects the predicate-argument 
layers of both monolingual subcorpora. Only the alignment 
layer is explicitly defined for a language pair rather than for 
a single language. Apart from this layer, the subcorpora are 
monolingual resources in their own right. 

Although, eventually, the treebank will prove useful for 
several fields of application, the most obvious one being 
machine translation, our main motivation is to contribute to 
linguistic research. The treebank will serve as a resource 
for both monolingual and contrastive analyses. 



'in accordance with the terminology suggested in 
jSinclair,"T9 94 1, we understand "parallel" to mean that the 
texts are translations of each other. 

2 There are a few drawbacks to Europarl, such as its limited 
register and the fact that it is not easily discernible which language 
is the source language. However, we believe that at this stage the 
easy accessibility, the amount of preprocessing and particularly 
the lack of copyright restrictions make up for these disadvantages. 



2. Reasons for Predicate- Argument 
Structure 

In a parallel treebank, it is necessary to capture the 
translational equivalence between two sentences. Our basic 
assumption is that this equivalence can best be represented 
by means of a predicate-argument structure. It is some- 
times assumed that predicate-argument structure can be de- 
rived or recovered from constituent structure or functional 
tags such as subject and object. 3 While it is true that these 
annotations provide important heuristic clues for the iden- 
tification of predicates and arguments, predicate-argument 
structure goes beyond the assignment of phrasal categories 
and grammatical functions, because the grammatical cate- 
gory of predicates and consequently the grammatical func- 
tions of their arguments can vary. 

For instance, it is very common for an English verbal 
predicate to be expressed by a nominalisation in German, as 
is the case in the NPs in |(l)| and |(2)| where the English verb 
nominate is translated as the German noun Nominierung. 

(1) their automatic right to nominate a member of the 
European Commission 4 

(2) ihr automatisches Recht auf Nominierung eines 
their automatic right on nomination of_a 
Mitglieds der Europaischen Kommission 
member of_the European Commission 

The annotations of these noun phrases are shown in Fig- 
ure Q] 5 It can be seen that the correspondence between 
NP508 and NP505 cannot be inferred from the constituent 
structure, since NP508 is an immediate constituent of an IE 
("extended infinitive") while NP505 is deeply embedded in a 
PP. Neither can the correspondence of NP508 and NP505 be 
inferred from their respective functional categories, since 
NP508 is a direct object (OD) while NP505 is a modifier (AG: 
"genitive attribute"). However, the resemblance between 
these constituents becomes apparent when they are marked 
for their argument status, because they both fulfill a similar 
role. 



3 See e. g. (Ma rcus et al., 1994> . 

4 Europarl:de-en/ep-00-02-15.al, 326. Note that throughout 
this paper, sentences are sometimes cited with irrelevant parts 
omitted. 

5 All figures are at the end of the paper. 



We have therefore chosen to represent predicate- 
argument structure on a dedicated layer in our treebank in 
order to be able to capture the parallelism between transla- 
tions and to use it as the basis for alignment. 

3. Details of the Predicate- Argument 
Annotation 

The predicate-argument structures used here consist 
solely of predicates and their arguments. Although there is 
usually more than one predicate in a sentence, no attempt is 
made to nest structures or to join the predications logically 
in any way. 6 The idea is to make the predicate-argument 
structure as rich as is necessary to be able to align a sen- 
tence pair while keeping it as simple as possible so as not to 
make it too difficult to annotate. In the same vein, quantifi- 
cation, negation, and other operators are not annotated. In 
short, the predicate-argument structures are not supposed 
to capture the semantics of a sentence exhaustively in an 
interlingua-like fashion. 

3.1. Predicates and Arguments 

In determining what a predicate is and how many there 
are in a sentence we rely on a few assumptions that are of 
a heuristic nature. One of these assumptions is that predi- 
cates are more likely to be expressed by tokens belonging 
to some word classes than by tokens belonging to others. 
Potential predicate expressions in FuSe are verbs, deverbal 
adjectives and nouns 7 or other adjectives and nouns which 
show a syntactic subcategorisation pattern. The predicates 
are represented by the capitalised citation form of the lexi- 
cal item (e. g. NOMINATE). Homonymous or polysemous 
predicates are differentiated by means of a disambigua- 
tor, predicates are assigned a class based on their syntactic 
form, and derivationally related predicates form a predicate 
group. 

Arguments are given short intuitive role names (e. g. 
ENTJMOMINATED) in order to facilitate the annotation pro- 
cess. These role names have to be used consistently only 
within a predicate group. If, for example, an argument 
of the predicate NOMINATE has been assigned the role 
ENTJMOMINATED and the annotator encounters a compa- 
rable role as argument to the predicate NOMINATION, the 
same role name for this argument has to be used. 

Keeping the argument names consistent for all predi- 
cates within a group while differentiating the predicates on 
the basis of syntactic form are complementary principles, 
both of which are supposed to facilitate querying the cor- 
pus. The consistency of argument names within a group, 
for example, enables the researcher to analyse paradigmati- 
cally all realisations of an argument irrespective of the syn- 
tactic form of the predicate. At the same time, the differen- 
tiation of predicates makes possible a syntagmatic analysis 

6 Since the predicate-argument structure is always bound to the 
constituent structure (see Section l7l2~l . it might well be possible to 
derive this information, e. g. through coordination structures and 
the hierarchical ordering of constituents. 

7 For all non-verbal predicate expressions for which a deriva- 
tionally related verbal expression exists it is assumed that they 
are deverbal derivations, etymological counter-evidence notwith- 
standing. 



of the differences of argument structures depending on the 
syntactic form of the predicate. 

3.2. Binding Layer 

All elements of the predicate-argument structure must 
be bound to elements of the phrasal structure (terminal or 
non-terminal nodes). These bindings are stored in a ded- 
icated binding layer between the constituent layer and the 
predicate-argument layer. 

When an expected argument is absent on the phrasal 
level due to specific syntactic constructions, the binding of 
the predicate is tagged accordingly, thus accounting for the 
missing argument. For example, in passive constructions 
like in Tabled the predicate binding is tagged as pv. Other 
common examples are imperative constructions. Although 
information of this kind may possibly be derived from the 
constituent structure, it is explicitly recorded in the binding 
layer as it has a direct impact on the predicate-argument 
structure. 



Sentence wenn korrekt gedolmetscht wurde 
Gloss if correctly interpreted was 

T 

Binding pv 

I 

Pred/Arg DOLMETSCHEN 

Table 1: Example of a tagged predicate binding 
(Europarl:de-en/ep-00-01-18.al, 2532) 

Bindings of arguments may be tagged as well, an exam- 
ple for this being object-control (cf. Table [2J. To account 
for the deviant case of the subject of the embedded clause in 
an object-control construction, the binding of this argument 
is tagged (oc-case). With this information, a researcher 
or a machine learner will be able to ignore a specific argu- 
ment which might distort statistics on the phrasal realisa- 
tions of arguments. 

The predicate binding is tagged as well to mark the en- 
tire object-control construction (oc). This tagging enables 
the researcher to filter out this specific predicate-argument 
structure, so as to ignore these constructions completely. 

Section 14.1.1 will show that linking predicates or argu- 
ments to constituents cannot always be achieved by bind- 
ing them to a single node in the constituent structure. In 
order to be flexible in this respect, the binding layer al- 
lows for complex bindings, with more than one node of 
the constituent structure to be included in and sub-nodes 
to be explicitly excluded from a binding to a predicate or 
argument. 8 

3.3. Alignment Layer 

On the alignment layer, the elements of a pair of 
predicate-argument structures are aligned with each other. 
Arguments are aligned on the basis of corresponding roles 
within the predications. Comparable to the tags used in the 
binding layer that account for specific constructions (see 



8 See the database documentation iFeddes, 2004 1 for a more 
detailed description of this mechanism. 



Sentence It was this which inspired us to propose the same thing with regard to state aid . 

T T T 

Binding oc-case oc [] 

I I I 

Pred/Arg PROPOSER PROPOSE PROPOSAL 



Table 2: Example of tagged predicate and argument bindings (Europarl:de-en/ep-00-01-18.al, 237) 



Section l3~2~l . the alignments may also be tagged with fur- 
ther information. This becomes necessary when the pred- 
ications are incompatible in some way. Section 14.3.1 will 
give examples. 

If there is no corresponding predicate-argument struc- 
ture in the other language or if an argument within a struc- 
ture does not have a counterpart in the other language, there 
will simply be no alignment. Section |4~2~1 provides an ex- 
ample where a predication is left dangling. 

Table [3] gives an overview of the annotation layers as 
described in this section. 



Layer 


Function 


Phrasal 


constituent structure of language A 


Binding 


binding j, predicates/arguments to f nodes 


PA 


predicate-argument structures 


Alignment 


aligning J predicates and arguments 


PA 


predicate-argument structures 


Binding 


binding f predicates/arguments to { nodes 


Phrasal 


constituent structure of language B 



Table 3: The layers of the predicate-argument annotation 



4. Problematic Cases 

In this section we will elaborate on some problematic 
cases of predicate-argument annotation which we have en- 
countered so far, some of them particular to the annotation 
and alignment of predicate-argument structures for a lan- 
guage pair. 

4.1. Binding Predicate-Argument Structure to 
Constituent Structure 

It was mentioned in Section[2that all predicates and ar- 
guments must be bound to either terminal or non-terminal 
nodes in the constituent sttucture. However, this is not al- 
ways possible since in some cases there is no direct corre- 
spondence between argument roles and constituents. For 
instance, this problem occurs whenever a noun is postmod- 
ified by a participle clause: in Figure |2] the argument role 
ENT_RAISED of the predicate RAISE is realised by NP525, 
but the participle clause (IPA517) containing the predicate 
(raised^) needs to be excluded, because not excluding it 
would lead to recursion. Consequently, there is no simple 
way to link the argument role to its realisation in the tree. 

In these cases we link the argument role to the appro- 
priate phrase (here: NP525) and prune out the constituent 
that contains the predicate (IPA517; see Section l3~2~l for this 
mechanism), which results in a discontinuous argument re- 
alisation. 



4.2. Coping with Modality 

Generally, modal verbs are not considered to be pred- 
icates and are consequently not included in our predicate- 
argument database. This can cause a problem when a ver- 
bal predicate that is modified by a modal auxiliary in LI 
|(3)| is represented by a deverbal noun in the corresponding 
sentence in L2 |(4)| 

(3) The laws against racism must be harmonised. 9 

(4) Die Harmonisierung der Rechtsvorschriften 
The harmonisation of_the laws 

gegen den Rassismus ist dringend erforderlich. 
against the racism is urgently necessary. 

This can be illustrated by Figure |3J the realisation of the 
verbal predicate HARMONISE (harmonised^) is modified 
by the modal auxiliary must^. In the German sentence, the 
nominal predicate HARMONISIERUNG (Harmonisierungi) 
is used. Here, the modality is expressed by a predicate of its 
own, namely ERFORDERLICH (erforderlichg, 'necessary'). 
This second predicate does not correspond to any predicate 
in the English sentence. 

It would be an easy way out to resort to annotating 
modal auxiliaries as if they were full verbs and conse- 
quently predicates, but we have opted against this makeshift 
solution. One has to keep in mind that the predicate- 
argument annotation is done monolingually and only later 
serves as the basis for alignment. It should not be assumed 
that the corresponding equivalent is known to the annota- 
tor during the annotation process. Even though the way a 
sentence is expressed in another language can give valu- 
able insights into its structure and meaning, this should not 
go so far as to change the way the original language is an- 
notated. This is particularly true since the idea behind the 
FuSe treebank is that it is in principle extendable and may 
well include languages other than English and German in 
the future. As it cannot be foretold what phenomena will 
be encountered once further languages are added, the deci- 
sions as to what is annotated and what is not should not be 
guided by cross linguistic considerations. 

Thus, the simple fact alone that a predication in one 
language does not correspond to a predication in another 
should not induce one to alter the annotation praxis so as to 
make the two versions more compatible with each other. 
Modality, in particular, can be expressed in a variety of 
ways, and just because one of them is the realisation as a 
predicative adjective does not make, say, a modal adverbial 
like certainly a predicate. The same argumentation holds 
for modal auxiliaries. 



9 Europarl:de-en/ep-00-01-19.al, 489. 



4.3. Incompatible Predications 

Sometimes, the predications in two corresponding sen- 
tences express approximately the same idea but are other- 
wise incompatible with each other. This can be demon- 
strated with sentences |(5)| and |(6)l the annotation, argument 
structure and alignment of which are illustrated in Figure|4] 

(5) Our motion will give you a great deal of food for 
thought, Commissioner 10 

(6) Eine Reihe von Anregungen werden wir Ihnen, 
A row of suggestions will we you, 
Herr Kommissar, mit unserer EntschlieBung 
Mr. Commissioner, with our resolution 
mitgeben 

give 

The incompatibility results from the fact that, while the 
predicates GIVE and MITGEBEN are roughly equivalent in 
meaning, the two sentences are organised differently with 
regard to their information structure. This has caused the 
two corresponding argument roles of GIVER and MITGE- 
BER to be realised by two incompatible expressions rep- 
resenting different referents (NP500 vs. wir^). The English 
version is somewhat metaphorical in that, unlike in the Ger- 
man sentence, there is no animate entity in this agent-like 
argument position. The actual agent is not realised as such 
and can only be identified by a process of inference based 
on the presence of the possessive pronoun our^. To com- 
plicate matters even further, the translational equivalent of 
NP500 (i- e - the constituent realising the English GIVER), is 
not even an argument in the German sentence (PPsos)- 

Consequently, it seems impossible to reach a satisfac- 
tory alignment in this case: either two arguments with the 
same role but different meanings would have to be aligned, 
or else the alignment would rely solely on translational 
equivalence, which would reduce to absurdity our reasons 
for including predicate-argument structure. 

We solve the problem as follows: since cases like this 
are at the same time potentially interesting for contrastive 
analyses and a hazard for applications using the treebank 
for automatic learning, we keep up the alignment on the 
basis of argument roles but tag the alignment (see Sec- 
tion l3.3.t between the arguments in question and thus mark 
them as being incompatible (incomp) with each other. 
This enables the interested researcher to formulate explicit 
searches for this alignment type while making it possible 
for applications to skip these cases if this is preferred. 

Sentences |(7)| and |(8)| are a second case where we make 
use of the possibility to tag the alignment. Here, the adjec- 
tival predicate INAPPLICABLE in |(7)| is represented by the 
negated predicate ANWENDBAR ('applicable') in the Ger- 
man counterpart |(8)| 

(7) the Directive is inapplicable in Denmark 1 1 

(8) die Richtlinie ist in Danemark nicht anwendbar 
the Directive is in Denmark not applicable 



10 Europarl:de-en/ep-00-01-18.al, 53. 
1 1 Europarl:de-en/ep-00-0 1 - 1 8.al, 2522. 



Since whether or not a predicate is negated does not al- 
ter its argument structure we do not annotate negation (see 
Section^. As this leads to an alignment of predicates with 
opposite meanings, we tag the alignment between the two 
predicates as abs-opp ("absolute opposites"). In theory, 
this method could also be applied to cases where a pred- 
icate is translated by its relational opposite (e. g. buy vs. 
sell). So far, however, we have not yet come across this 
type of translation in our data. It will be interesting to dis- 
cover what types of incompatibility will come to light as 
the annotation proceeds. 

5. Database Structure and Tools 

We use Annotate jPlaehn, 199 8a I for the semi- 
automatic assignment (Brants, 1999 1 of POS tags, hierar- 
chical structure, phrasal and functional tags. ANNOTATE 
stores all annotations in a relational database. 12 To stay 
consistent with this approach we have developed an ex- 
tension to the Annotate database structure to model the 
predicate-argument layer and the binding layer. 

Due to the monolingual nature of the ANNOTATE 
database structure, the alignment layer (Section 13.3.1 can- 
not be incorporated into it. Hence, additonal types of 
databases are needed. For each language pair (currently, 
English and German), an alignment database is defined 
which represents the alignment layer, thus fusing two ex- 
tended Annotate databases. Additionally, an administra- 
tive database is needed to define sets of two ANNOTATE 
databases and one alignment database. The final paral- 
lel treebank will be represented by the union of these sets 
dFeddes, 20041 . 

While annotators use ANNOTATE to enter phrasal and 
functional structure comfortably, the predicate-argument 
structures and alignments are currently entered into a struc- 
tured text file which is then imported into the database. A 
graphical annotation tool for these layers is under devel- 
opment. It will make binding the predicate-argument struc- 
ture to the constituent structure easier for the annotators and 
suggest argument roles based on previous decisions. 

6. Relation to Other Projects and Outlook 

This section will show briefly how our approach re- 
lates to other projects annotating some kind of predicate- 
argument structure, such as PropBank (Pal mer et al., 2 003 1 
and FrameNet dJohnson et al., 2003) , and how the align- 
ment structures of the parallel treebank make up for certain 
drawbacks of our annotation scheme. 

Since our annotation of predicates and their arguments 
is not a means in itself but to the end of aligning con- 
stituents of a parallel treebank, it is kept deliberately sim- 
ple. It resembles the mnemonic descriptors clarifying the 
numbered arguments in the PropBank framesets. We do 
not, however, attempt any generalisation whatsoever: nei- 
ther do we organise our predicates in frames, as is done by 
FrameNet and adopted by SALSA ( |Erk et al., 2003) , nor do 
we follow the Levin classes ( |Levin, 1993) , as is done in the 
PropBank project. 



"For details about the ANNOTATE database structure see 
<Plaehn, 1998bl 



Some problems we encounter with our simple scheme 
could be avoided with a deeper predicate-argument struc- 
ture. As the first example in Section 14.3.1 shows, predica- 
tions which are incompatible in our scheme need not be 
incompatible in a FrameNet-like scheme: if the argument 
roles were deeper than our intuitive role names, i. e., if our 
motion in example |(5)l were not a GIVER but, e. g., a CAUSE, 
the incompatibility with the corresponding structure in |(6)| 
would not arise. 

There are several reasons for us to stick to our sim- 
ple approach. For one thing, a more complex scheme 
would make the annotation more susceptible to inconsis- 
tencies. Secondly, transferring the approaches mentioned 
above to other languages than English is not a straightfor- 
ward matter. While this seems to be working quite well 
for the FrameNet frames ( |Erk et al., 20 03 ), Levin's verb 
classes are inherently English and cannot be directly ap- 
plied to German. In a later stage of the project, it might be 
possible to work through the predicate-argument database 
and map our very specific scheme to a more general one, 
e. g. by assigning each predicate to a frame and each ar- 
gument to a frame element. However, other studies show 
that mapping one scheme onto another is far from trivial 
( |Hajicova and Kucerova, 2002) , and quite a lot of manual 
work will presumably be necessary. 

Finally, we believe it is possible to exploit the corpus 
as a parallel lexical resource to see how different pred- 
icates can be clustered automatically by analysing their 
mappings in the other language. Figure |5] sketches the 
general idea. Suppose that in the English sub-corpus, 
two predicate-argument structures have different predicates 
(BUY and PURCHASE) which subcategorise for comparable 
arguments and express the same concept. In a FrameNet- 
like annotation, these predicates would be instantiations of 
the same frame (e. g. COMMERClAL_TRANS ACTION). In 
our scheme, neither are these predicates grouped in any 
way, nor do the comparable arguments get the same role 
names. 

However, it is well conceivable that both predicates 
are translated identically in the corresponding German 
structures (e.g. by KAUFEN 'buy'). Since predicates 
and arguments are aligned to each other, the compara- 
bility of the predicates (BUY - PURCHASE) and their 
arguments (BUYER - PURCHASER and ENT.BOUGHT - 
ENT_PURCHASED) can be derived (cf. the dashed lines). 
It will then be instructive to investigate how these clusters 
compare to FrameNet frames and to explore to what extent 
such a data-driven approach to frame semantics is feasible. 
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Figure 1: Alignment of a verb/direct-object construction with a noun/modifier construction 
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Figure 2: Complex constituent binding of an argument 
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Figure 3: Modality 
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Figure 4: Incompatible predications 



Sentence X | Sentence Y 

English BUYER j BUY-v ENT_BOUGHT \ PURCHASER PURCHASE-v ENT_PURCHASED 

German KAUFER ; KAUFEN-v GEKAUFTES ; KAUFER KAUFEN-v GEKAUFTES 



Figure 5: Deriving predicate clusters by exploiting alignment structures 



